Covered
Why graphics?
Rules of good graphics
Some bad graphics
Assumption testing: iterative process
If unable to transform: non-parametric approach
When assumptions are violated, we can:
1. Transform data
2. Use robust methods
3. Use non-parametric tests
Assumption testing: iterative process
If unable to transform: non-parametric approach
When assumptions are violated, we can:
Assumption testing: iterative process
If unable to transform: non-parametric approach
When assumptions are violated, we can:
1. Transform data
2. Use robust methods
3. Use non-parametric tests
Shapiro-Wilk normality test
data: trout_data$mass_g
W = 0.87436, p-value < 2.2e-16
# Testing equality of variances across sampling sites
# First create a model
mice_model <- lm(mass_g ~ sampling_site, data = trout_data)
# Then test for homogeneity of variances
car::leveneTest(mice_model)Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 26.352 3.911e-07 ***
569
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When assumptions are violated, we can:
1. Transform data
2. Use robust methods
3. Use non-parametric tests
# Function to calculate family-wise error rate
family_wise_error <- function(alpha_per_test, num_tests) {
1 - (1 - alpha_per_test)^num_tests
}
# Create a data frame of family-wise error rates
error_rates <- tibble(
num_tests = c(1, 2, 5, 10, 20, 50, 100),
error_rate = family_wise_error(0.05, num_tests)
)
error_rates# A tibble: 7 × 2
num_tests error_rate
<dbl> <dbl>
1 1 0.0500
2 2 0.0975
3 5 0.226
4 10 0.401
5 20 0.642
6 50 0.923
7 100 0.994
# Let's perform multiple t-tests on our mice data
# Compare mass between each pair of sampling sites
# Get unique sampling sites
sites <- unique(trout_data$lake)
num_sites <- length(sites)
num_comparisons <- num_sites * (num_sites - 1) / 2
# Matrix to store results
results <- data.frame(
comparison = character(num_comparisons),
p_value = numeric(num_comparisons),
stringsAsFactors = FALSE
)
# Perform pairwise t-tests
counter <- 1
for (i in 1:(num_sites-1)) {
for (j in (i+1):num_sites) {
site_i_data <- trout_data$mass_g[trout_data$lake == sites[i]]
site_j_data <- trout_data$mass_g[trout_data$lake == sites[j]]
test_result <- t.test(site_i_data, site_j_data)
results$comparison[counter] <- paste(sites[i], "vs", sites[j])
results$p_value[counter] <- test_result$p.value
counter <- counter + 1
}
}
# Apply different p-value adjustments
results$bonferroni <- p.adjust(results$p_value, method = "bonferroni")
results$holm <- p.adjust(results$p_value, method = "holm")
results$BH <- p.adjust(results$p_value, method = "BH") # Benjamini-Hochberg
# Display results
results %>%
arrange(p_value) %>%
mutate(across(where(is.numeric), round, 4)) comparison p_value bonferroni holm BH
1 NE 12 vs Toolik 0.6718 0.6718 0.6718 0.6718
# First, let's look at the data as a table
mice_summary <- trout_data %>%
group_by(lake) %>%
summarize(
n = n(),
mean_mass = mean(mass_g),
sd_mass = sd(mass_g),
min_mass = min(mass_g),
max_mass = max(mass_g)
)
mice_summary# A tibble: 2 × 6
lake n mean_mass sd_mass min_mass max_mass
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 NE 12 322 534. 520. 9 2320
2 Toolik 249 518. 373. 0.15 3400
According to Tufte (2001), good scientific graphics:
# Let's create a plot showing several layers of information
pine_summary <- pine_data %>%
group_by(group) %>%
summarize(
mean_length = mean(len_mm),
sd_length = sd(len_mm),
n = n()
) %>%
mutate(se_length = sd_length / sqrt(n),
conf_low = mean_length - qt(0.975, n-1) * se_length,
conf_high = mean_length + qt(0.975, n-1) * se_length)
pine_summary# A tibble: 4 × 7
group mean_length sd_length n se_length conf_low conf_high
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 cephalopods 18 3.86 12 1.11 15.5 20.5
2 crayfish 18 3.86 12 1.11 15.5 20.5
3 salmon 16.3 3.94 12 1.14 13.8 18.8
4 snail 18.3 2.27 12 0.655 16.9 19.8
To make good graphics:
# Let's create two versions of the same plot
# First, a "poor" version with low data-ink ratio
library(ggthemes)
p1 <- ggplot(trout_data, aes(x = lake, y = mass_g)) +
geom_bar(stat = "summary", fun = "mean", fill = "lightblue",
color = "black") +
geom_errorbar(stat = "summary", fun.data = "mean_se", width = 0.5) +
# theme_excel() +
labs(title = "Average trout Mass by Sampling Site",
subtitle = "This plot has a low data-ink ratio",
x = "Sampling Site", y = "Average Mass (g)")Common problems in graphics:
Let’s create some plots with ggplot2 using the mice data:
# Basic scatter plot
ggplot(trout_data, aes(x = lake, y = mass_g)) +
geom_point() +
labs(title = "Mouse Mass vs. Year",
x = "Year", y = "Mass (g)")# Scatter plot with grouping and trend line
ggplot(trout_data, aes(x = lake, y = mass_g, color = lake)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Mouse Mass vs. Year by Sampling Site",
x = "Year", y = "Mass (g)") +
theme_minimal()Key points about multiple testing:
Principles of good graphics:
When applying multiple testing corrections:
Alternative to multiple pairwise tests: - ANOVA with post-hoc tests - Planned comparisons - Multilevel models
In this lecture, we’ve:
Key takeaways:
Things that stood out
What does not make sense or what questions do you have…
What makes you nervous?